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Abstract — Is it a good idea to use the frequency of events in 
the past, as a guide to their frequency in the future (as we all 
do anyway)? In this paper the question is attacked from the 
perspective of universal prediction of individual sequences. It is 
shown that there is a universal sequential probability assignment, 
such that for a large class loss functions (optimization goals), the 
predictor minimizing the expected loss under this probability, is a 
good universal predictor. The proposed probability assignment is 
based on randomly dithering the empirical frequencies of states 
in the past, and it is easy to show that randomization is essential. 
This yields a very simple universal prediction scheme which is 
similar to Follow-the-Perturbed-Leader (FPL) and works for a 
large class of loss functions, as well as a partial justification for 
using probabilistic assumptions. 



I. Introduction 

In this paper the problem of universal sequential prediction 
of an individual unknown sequence is considered 00(5), 
and a prediction approach based on universal probability 
assignment is proposed. Given a space of strategies B, a space 
of nature states X and a loss function l(b, x), b € B, x € X, the 
purpose is to assign the next strategy b t given the knowledge 
of the past states x' , such that the overall loss 2tLi l(t>t,xt) 
would be asymptotically close to the loss obtained by the 
best fixed strategy known a-posteriori after viewing the entire 
sequence x", i.e. minb 6 g StLi ^> x t)- 

In the particular case of sequential probability assignment 
under the log loss function l(b,x) = log where B is the 
space of probability assignments on the finite alphabet X, or 
equivalently in universal sequential compression, it is shown 
00 §13]|fl] §9] that it is possible to assign probabilities 
Pt(x t ) for the next state in an arbitrary sequence of states 
Xt € X, t = 1, 2, . . . , n, given the past states, such that for any 
possible sequence, the overall probability j5(x) = Y\t=i Pt( x t) 
would not be too far, in a multiplicative or logarithmic sense, 
from the best i.i.d. probability assigned to the sequence a- 
posteriori maXp(.) Yit=iP( x t)- The result extends to proba- 
bility assigned by Markov machines or finite state machines 
0. This problem is related to universal compression because 
the overall compression length corresponds to log ^^y^- A 
remarkable feature of these universal probability assignments 
is that, although nothing is assumed about the sequence, to 
construct a universal encoder it is enough to encode as if p t (-) 
was the true probability of the next state. 

These universal probability assignments, such as the 
Laplace §13.2] or Krichevsky-Trofimov (KT) J6] estima- 
tors, have an intuitively appealing structure which induces 



a small bias over the empirical distribution seen so far. For 
example, Laplace's estimate for the probability distribution of 

of xt is 

N t -i(x) + l 



Pt{x) 



(*-!) + 1*1 



(1) 



where N t (x) denotes the number of times the state x appears 
in x*. While these estimators get closer with time to the 
measured empirical distribution, they do not "trust" it com- 
pletely, and, for example, never assign a probability value 
to states that had not appeared before. Furthermore, in the 
probabilistic prediction setting the same distributions were 
shown to perform well not only for the log loss: the predictor 
which minimizes the expected loss under these distributions 
b\ = argmin E l(b, X) operates well for a wider class of 

loss functions §111. A. 2]. 

This naturally leads to the following question: is it possible 
to forecast an individual sequence by first generating a proba- 
bility assignment based on the past, and then minimizing the 
expected loss under this assignment (i.e. in a way, acting as 
if future events truly happen with this probability)? Consider 
prediction schemes of the following form: 

1) Generate a probability assignment P t (x) based on the 
past of the sequence x* _1 , in a way which does not 
depend on the loss function. 

2) To predict bt under the loss function lib, x), choose the 

(u) 

strategy that minimizes the expected loss under P t , 
i.e.: 

b t = argmin E [l(b,X)) (2) 



argmin 

b X' 



(•) 



If there exists a single scheme for generating (•) that does 
not depend on the loss function lib, x), but for which b t yields 
a good (Hannan-consistent 0) predictor for a certain class 
of loss functions, then we call Pj (•) a universal sequential 
probability assignment with regards to that class. Notice that 
this term has been used in the past with respect to the log-loss, 
so the definition above can be considered a natural extension. 

It is easy to show that, if the class of loss functions includes 
even simple loss functions such as the 0-1 loss (the number 
of errors), then no deterministic assignment can be universal, 
and therefore the Laplace or KT assignments are inadequate. 
However, it is shown in this paper that the random assignment 
obtained by slightly perturbing the empirical frequencies is 
universal for a large class of loss functions, including the log- 
loss and any bounded loss. 
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In addition to supplying a simple and general universal pre- 
diction scheme, this result also has interpretations contributing 
to our understanding of probability. For example, it supplies 
justification for treating the statistics of a process in the past 
as a guide to its statistics in the future, without having to 
assume the process is indeed stationary, or that it is driven by 
a "probabilistic" law. In other words, if our natural behavior 
is in some way similar to the prediction algorithm described 
here, then the claims on its convergence can be used to justify 
this behavior. 

The next section completes the problem definition and 
discusses the boundaries of the solution, and relations to 
known results. Section [III] gives the main results, and Sec- 
tion [IV] discusses the possible implications on understanding 
probabilistic behavior. The proofs are given in Section |V] 

II. Problem statement and discussion 

Building upon the definitions already presented in the 
introduction, in this section some complementary definitions 
are presented. We assume throughout this paper that X is 
finite (otherwise there is no meaning to measuring empirical 
frequencies). The set of possible strategies B is not restricted. 
The loss function I (6, x) is constant over time. 

Let us define the accumulated loss of a sequential predictor 



and the loss of the best fixed strategy as: 



t=i 



L n = mm 2^ 



l(b,x t ). 



(3) 



(4) 



The difference L n — L* n which is defined as the regret, is a 
function of the predictor and the sequence. The worst case 
regret is: 

7?-max = max ( L„ — L* n ) , (5) 



and the normalized regret is 



Tin 



A forecasting strategy b t 



Tin 



; o 



is said to be Hannan-consistent, if limsup n _ 
almost surely (the probability is over the randomization in the 
forecaster if it is random). This means that for large n, the loss 
of the forecaster is essentially at least as small as that of any 
fixed strategy. As mentioned in the introduction, the problem 
addressed in this paper is of finding a sequential probability 
assignment Pj (•) such that the resulting prediction scheme 
is Hannan-consistent for a large class of loss functions. 
We will focus mainly on bounding the expected loss (over the 
predictor's randomization), because it also leads to almost- 
sure bounds by applying the strong law of large numbers. The 
maximum expected regret is defined as: 



K„ 



maxE 



L n — L,- 



(6) 



For some loss functions satisfying smoothness conditions 
Q] Thm 3.1 IE Thm 1], the forecasting strategy known as 
"Follow the Leader" (FL), which chooses at each time the 
best strategy in retrospect b 



(FL) 



argmin J2l=i 

b 



is Hannan consistent. Rewriting the above as 6j FL) = 
argmin Y^xex Nt ^l[ x ^ l(b, x), it can be interpreted as an im- 

b 

plementation of yji where the universal probability assignment 



equals the empirical frequencies P t (•) 
words, for this family of loss functions, there is a simple so- 
lution for Pj (•), namely the empirical distribution. However 
this class of loss functions where FL is universal, is rather 
limited. 

For a probability assignment to be "general" enough, one 
would want to cover, at the least, the family of discrete- 
strategy, discrete-state loss functions, presented by Hannan 
J7). For this family, the loss function can be represented 
by a general \B\ x \X\ matrix specifying the loss for each 
strategy and each state of nature. It is well known JT] §4] 
and straightforward to see that randomization is required in 
order to cover this class: consider the 0-1 loss case, i.e. binary 
sequences X = B ~ {0, 1} with l(b, x) = Ind(fe 7^ x), where 
the total loss is the number of errors. For this loss function, 
no deterministic predictor yields Hannan-consistency, because 
for each deterministic predictor there exists a sequence which 
fails the predictor completely, by choosing the next outcome 
as the opposite of the predictor's choice, while the loss of the 
best fixed predictor is at most n/2. Because a deterministic 
Pj u \-) inevitably leads to a deterministic predictor (|2), this 
implies a random P^ (•) is required, in general. 

For the binary 0-1 loss problem, Feder, Merhav and Gutman 
(0 used a small dither when the empirical probability is close 
to i, which effectively avoids a decision when the frequencies 
of 0, 1 are nearly equal^For this specific problem, the optimal 
solution (in the sense of minimax regret) is known exactly 
and was presented by Cover @. While the optimal dither in 
this problem is different than the straight line used by Feder, 
Merhav and Gutman, and is not known in general, this is 
of no consequence in the current problem, as we are only 
considering Hannan consistency. This solution, as well as the 
small bias from the empirical distribution which is required 
in the log-loss problem (Q~|i, motivates the following choice of 
Pj (■): add a small dither to N t -i(x) (the counts of events 
in the past) and re-normalize. As shown below, this solution 
achieves Hannan-consistency for any bounded loss function 
and for the log loss. 

The proposed forecaster is reminiscent of the scheme termed 
"Follow the Perturbed Leader" (FPL), originally proposed by 
Hannan 0, in which the decision in obtained by adding 
a small dither to the accumulated loss of every reference 
strategy and then choosing the best one. Indeed, dithering 
the frequencies is similar, but not equivalent, to dithering the 
accumulated losses, and our proof technique for the bounded 
loss case borrows from Kalai and Vempala's |[T0l . Following 
this similarity we term the scheme proposed here "Follow 
the Perturbed Frequency" (FPF). Notice, however, that FPL 
is defined, in general, only when the number of strategies is 
finite, while FPF is defined, in general, only when the number 
of outcomes (states) is finite, and does not have to assume 

It is interesting to note that for the 0-1 loss problem their forecaster 
is equivalent to a "Follow the Perturbed Leader" forecaster with a uniform 
distribution (see below) and also equivalent to the forecaster proposed here. 



_ JV t -i(x) 
t-1 



In other 
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the number of strategies is finite. On the other hand, FPL can 
deal with more general forms of the problem, including time- 
varying loss functions. 

The problem considered here is a close relative of the 
calibration problem JT] §4.5], i.e. the problem of estimating 
from an individual sequence, probability forecasts that pass 
certain consistency tests. The problems are related in that, in 
both cases it is shown possible to generate from empirical data 
collected from an individual sequence, probability assignments 
that appear to operate as well as forecasts which are based on 
knowledge of the "true" statistical model. Also, randomization 
is essential in both cases. However, none of the problems is 
a special case of the other: the probability assignment shown 
here is not necessarily calibrated, and a calibrated probability 
assignment does not necessarily satisfy the requirements of 
the current problem^ 

In this paper, in order to simplify matters, only fixed 
strategies are considered. As one of our motivations is to 
rationalize the behavior of learning probabilities from the past, 
it is enough to consider fixed strategies in order to see the 
advantage of this behavior. The extension to dynamic reference 
strategies is unfortunately not immediate as in the setting of 
prediction with expert advice [T| §2], where dynamic strategies 
can be turned into fixed ones by simple enumeration (i.e. 
replacing the strategy with the index of the strategy), because 
we explicitly assume a fixed loss function. However in some 
cases, the core of the prediction problem lies in competing 
with fixed strategies. For example, reference strategies defined 
by states (such as Markov predictors or finite state machines), 
can be considered as fixed strategies in each sub-sequence 
belonging to the same state. 

III. Main results 

Let N t (x) be number of times a specific x occurred in the 
sequence x up to and including time t. The universal sequential 
probability assignment is defined as: 

P t iu \x)=Cf(N t _ 1 (x) + h t -u t (x)) 

Nt-ljx) + ht ■ u t (x) (7) 
t-l + ht - Trx'exMx') 

where c t = ^2 x( zx(N t -i(x) + h t ■ u t {x)) is the normalizer 
guaranteeing unit sum. u t (x) ~ U[0, 1] is a random dither 
which is assumed to be uniformly distributed, i.i.d. over 
different x and t (dependence over t does not affect the 
expected regret). h t is a non-decreasing positive sequence. 
Our philosophical considerations (i.e. justifying probabilistic 
behavior) motivate keeping ht as general as possible rather 
than finding a specific optimal sequence ht for each problem. 

The FPF predictor, for any loss function l(b,x) is defined 
by: 

argmin E [l(b,X)] (8) 



6 (FPF) 



beB X~P t l "'(:r) 



2 Consider for example the 0-1 loss problem, and a sequence containing 
an equal number of zeros and ones. Any probability forecaster yielding only 
values in the range 0.5 ± e is e-calibrated, while the decisions based on these 
probabilities (when plugged into |2j) can be arbitrary (depending on whether 
the probability is smaller or larger than 0.5), and can yield arbitrarily bad (or 
good) aggregate losses. 



Theorem 1. Assuming ht = h\ - t a , with a £ (0, 1), the FPF 
predictor is Hannan-consistent for any bounded loss function 
and for the log-loss. Therefore under these conditions, P t i x ) 
defined in (0 is a universal probability assignment for the 
class. 

This theorem is based on the two following theorems: 

Theorem 2. Assume the loss function is bounded \l(b,x)\ < 
R. Then: 

1) The expected regret of FPF is upper bounded by 

n 

ftma* < 2RY, K 1 + lR\X\h n (9) 
t=l 

2) Particularly, for any h t — hi ■ t a , with a G (0, 1), the 
normalized expected regret ^7£. max tends to zero with 
n. 

3 ) For ht = ^ I^ raax < 4R^f. 

Corollary 2.1. The theorem holds under a milder condition, 
that the loss function is bounded only for the set of optimizing 
strategies, defined as 



Bopt = \ argmin 2^ X(x)l(b,x) : A(x) > 0,3x : A(x) > 



beB 



xeX 



(10) 

and where R = swp xeX beBa ^ t l(b,x). Particularly, the the- 
orem holds for the L-2 norm loss, Z(b, x) = ||b — x|| 2 
for X C R d (\X\ < oo), and B = R d . In that case 
R = max X)X / e ^ ||x — x'|| 2 is the squared diameter of the 
set X. 



Notice that in the most general case without any limitations 
(such as on magnitude), it is generally impossible to devise 
a universal scheme for the norm loss that beats the best 
fixed strategy, i.e. the empirical mean up to a constant, and 
it is made possible in the current problem by the assumption 
that X is finite. 

The proof of Theorem |2] is similar in spirit to the proof 
of Kalai and Vempala flOl for the FPL forecaster, as the 
perturbation on N t (x) can be translated to a perturbation on 
the accumulated loss. 



Theorem 3. For the case of the log-loss, where b(x) is a 
probability distribution over X and l(b, x) = log ^j^y^Jl the 
expected regret of FPF satisfies: 

1) 



n < V 



\X\ht - 1 



n i 



2) Particularly, for any h t = h\ ■ t a , with a £ [0, 1), the 
normalized expected regret —lZ max tends to zero with 
n. 

3) For constant ht the expected regret behaves like 
O(logn) and specifically for the choice h = l^l -1 , 
ftmax < |*| log (n). 



All log-s in this paper are in the natural base. 
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Regarding the last case, notice that this redundancy is 
similar to the redundancy obtained with Laplace's estima- 
tor and approximately twice the redundancy obtained using 
Kritchevsky-Trofimov's (which is approximately - — jj — logn). 
However notice that the target of the FPF forecaster was not 
to produce optimal redundancy for specific loss functions. 

The proofs of the theorems stated above appear in Section IV1 
below. 

IV. Implications on the understanding of 

PROBABILITY 

A. Initial probabilities 

A basic question in the application and philosophy of 
probability theory is: where do initial probabilities originate 
from (see, e.g. iTPTl ) ? The fact is, that in many situations a 
probability distribution is deduced from the relative frequency 
of events in the past. While this deduction may be justified 
based on some stationarity assumption, it is often used exactly 
in those situations where precise analysis of the source of 
events is not possible, and therefore the assumption that the 
frequency of events in the future would be similar to their 
frequency in the past is not necessarily justified. In spite of 
this, we often deduce a probability distribution based on past 
statistics and use this probability for decision making with 
regards to future events. It seems that not only humans but 
also animals use this principle iTPH . 

One motivation for the problem posed in Section [II] of 
searching for a universal probability assignment, is the attempt 
to justify this behavior based on mathematical, rather than 
physical assumptions. The theory of universal prediction of 
individual sequences, or repetitive games, seems a good frame- 
work for this purpose, because it facilitates deduction from 
the past, without assumptions that the past indicates anything 
with respect to the future. The existing universal prediction 
schemes are less suitable for this purpose since they determine 
the next strategy in a contrived way, as a function of the past 
frequencies and the loss function, whereas in the probability- 
based decision making, it is assumed that there exist a single 
"true" probability. 

The success of the FPF predictor for a large set of loss 
functions, indicates that indeed it is useful to rely on past 
frequencies, and draw from them a "probability" distribution, 
even if the future is arbitrary. The dither may be interpreted 
as the assumption that the future would be similar but not 
identical to the past, and prevents using a too "decisive" 
strategy (such as choosing '0' or '1' in the 0-1 loss case), 
based on a small change in the frequencies. It would be 
farfetched to claim that this is the justification for using 
probabilities: clearly the reason is related to the regularity 
that many natural processes exhibit; however it supplements 
our intuitive understanding by showing that even if these 
assumptions fail, there is still benefit in learning probabilities 
from the past. 

B. Meaning of probability 

In the previous section we tried to justify a specific choice 
of a probability. However, probability itself is not a well 



defined concept, and many attempts to explain or justify its use 
have been made. A good introduction to these philosophical 
questions can be found in IfTTI (for a quick overview see 
|[T3l Chap. ??]). While there is no dispute on the mathemat- 
ical axiomatic theory dealing with probability functions, the 
meaning of probability, and the justification for using it are 
questionable. 

In a nutshell, the main interpretations to probability are the 
relative frequency approach, a-priori or logical approach an the 
subjectivistic approach. Relative frequency theories interpret 
probability as the limiting frequencies in very large groups 
of events (called "collectives"). A-priori theories interpret 
probability as logical relation between sentences, and an 
extension of formal logic: the attributes "true" and "false" 
are represented by probabilities of 1 and 0, and are extended 
by adding a range of probabilities in between. Subjectivistic 
theories interpret probability as a measure of the degree of 
belief of a certain person in a certain proposition, and therefore 
its value is not unique. 

A main issue in all interpretations is what probability 
means with respect to the future. The current results can be 
interpreted under the framework of the subjectivistic theories, 
which view probability as a tool for decision making, i.e. 
probability is just the relative weight that we put on each future 
event when making decisions. Because under subjectivistic 
theories any probability is valid, there is a problem of justi- 
fying any specific choice of a probability assignment, as well 
as the merit of making decisions according to probabilistic 
considerations. 

The current results can be thought of as a partial resolution 
to this question: the suggestion of learning probability from 
the past by biasing or dithering past frequencies, is a good one 
in the sense that it is better than any fixed behavior (and as a 
result, of making decisions according to any fixed probability). 
This demonstrates a clear merit in following probabilistic 
considerations, which is not dependent on any assumptions 
with respect to the real world (the process x t ). 

There are some issues, however, with this interpretation. 
First, the problem setting is limited, compared to our actual use 
of probability. Learning from experience extends far beyond 
the framework of repetitive games and constant loss functions, 
as we usually deduce probabilities from the past and use 
them to solve new problems. Also the fact X is assumed 
discrete is somewhat limiting, although it may be sufficient 
to justify probabilistic intuition, which is fundamentally based 
on distributions on finite sets (such as coins and dice). 

But the main weakness of this interpretation is that it 
relies on randomness for generating the universal probability 
assignment P' u ) (and as a result, the claims we can make are 
also probabilistic), and so it may lead to a cyclic argument of 
explaining probability by using probability. The randomness 
used here is in a restricted form of "controlled randomness" 
which is generated by the forecaster. I.e. if we believe it is 
possible to draw random coins, it is enough for this interpre- 
tation to hold and be meaningful. An alternative assumption 
is pseudo-randomness, i.e. assume that we can generate the 
dither not randomly, but such that "nature" (drawing the 
next xt) cannot guess it, and it appears effectively random. 
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Unfortunately, like in many other theories, we are not able to 
escape some form of "belief" or conjecture with respect to the 
future. 

Another way to avoid the need for randomness is to avoid 
problems such as the 0-1 loss case, in which one is forced 
to bet, problems that are insolvable without randomness. For 
example, if the loss is convex with respect to the strategy, then 
the loss when taking the expected value of a random strategy 
b is always better than the expected loss when b is random. 
In this case, the forecaster can make a deterministic decision: 
replace © with b t = E {&< FPF) }, where the expected value is 

with respect to the randomness of pW . This can be thought of 
as a different rule for making decisions based on the past: take 
as probability the empirical frequencies in the past, however 
when making a decision which changes significantly with 
respect to small variations in the probability, take the average 
decision over these small variations. This rule is deterministic 
and aligns with intuition, however the restriction to "smooth" 
loss functions may be too limiting. 

Another question that would naturally arise with respect to 
this explanation is how it aligns with the fact that, at least in the 
theoretical application of probability theory (e.g. estimation 
theory, communication theory) we do not use dithers in our 
probabilities. It seems that the idea of dithering the probabil- 
ities is a similar notion to the idea of checking sensitivity of 
a given solution to the probabilistic assumptions. In case the 
solution to a given problem does not depend crucially on the 
exact probability values, adding the dither is indeed redundant. 
On the other hand, if the solution depends crucially on a small 
change in the probabilistic assumptions, it may be reasonable 
to doubt its operation in the real world. 

V. Proofs 

A. Proof of Theorem [2] 

The proof follows the same line of thought of Kalai- 
Vemplala ifTUll : first, the regret of a clairvoyant forecaster 
using x t in addition to X*" 1 is bounded. Then, the difference 
in performance between the clairvoyant forecaster and the 
proposed forecaster is bounded, by using the fact that some 
of the dither works in the same direction as the the difference 
between them. 

1 ) Definitions: The cumulative loss for playing the constant 
strategy b up to time t is Lt(b) = 2~2i=i l(bi x t)- We denote 
for brevity B{L(6)} = argminL(6) the best strategy for 

b 

cumulative loss function L(b). 

The optimal fixed (a-posteriori) best fixed strategy is 
B{£„(6)} and has loss L* n = L n (B{L n (b)}). As another 
example to clarify the notation, the FL predictor can be written 
as B{Lt_i(&)}, and the FPL predictor |[T0l can be written 
B{£t-i(6) + Pt(b)} where pt{b) is a random perturbation. 

Let us define the dithered count at time t — 1 as 



N^x(x) 4 JV t _i(a:) + h t u t (x), 
and the respective dithered accumulated loss as 



(12) 



(13) 



This loss could be thought of as the loss during a sequence 
which is an extension of the actual sequence with some 
random states. Notice the distinction between L n defined in 
(01, which is the loss of the universal predictor, and L'f ' 
which is the accumulated loss whose minimization yields the 
predictor. The distribution is proportional to N^^x), 
and thus the FPF forecaster is equivalent to optimizing the 
dithered loss: 



i(FPF) „ 

b t = argmm 



E [l(b,X)] 
argmin p} u \x)l(b, x) 



bEB 



argmin 

bEB 



c t ■ 



(14) 



= B{L^ 1 (b)}. 



Notice that the constant c t does not affect the minimum. 
2) Bounding the expected loss: In terms of the expected 



loss 



L, 



, only the marginal distribu- 



l{b u xt) 

tion of b t matters, and therefore dependence between u t (x) at 
different times does not affect the expected loss. Therefore in 
this section, we assume all Ut are equal, Ut{x) = u\(x). 

We start by analyzing a clairvoyant predictor which includes 
also the state xt into the prediction. For this purpose, let us 
define analogously to (fT2l -(fT3li: 



N?\x) 4 N t (x)+h t u t (x), L?\b) ±^2l(b,x)N™(x) 

X 

(15) 

where for t = we define ho = 0, and note that Nq(x) = 
by definition, and therefore ^ cl) (a;) = and L ( cl '(b) = 0. We 
consider the loss of the predictor b t = B{Lj cl) (6)}: 



E'Wwi.^) 

i=l 

n 

(a) 



EE[l(B{ir(6)},x)W(«) - Nt-i(x)j\ 

t=l X 

n 

=EEPWW}.«)(^w - jv t ( -i(*))] 

t=l x 
n 

-EE [K^{L ( t ,} (b)}, x)(h t u t (x) - ht^ut-^x))] 

t=l x 

(16) 

where in (a) we used N t (x) — N t -i{x) as an indicator function 
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lnd(x t = x). The first part can be bounded as: 

n 

t=l x 

n 

= J2 [LF\B{Lr (b)}) - L^(B{L?\b)})] 

t=l 

/ \ n 

< Y, [L™<p{L?\b)}) - L£\(B{L£\(6)})] 



t=i 



(17) 



L^(B{L^(b)})-L^(B{L^(b)}) 
= Llf(B{L^(b)}) 
<L^(B{L n (b)}) 

= L n (B{L n (b)}) + h n J2 l(B{L n (b)} 

x 

< L n (B{L n (b)}) + R\X\h n 
= L* n + R\X\h n , 

where we used (a) the fact that B{L { ^\{b)} is optimized for 
i^-i and (b) the sum of the telescopic series. For the second 
sum in ( fT6] l, let us use the assumption ut = u\. Then: 



^[l(B{i« d) (i)}, I )(/ ltB( ( I )-/ lw u i _ 1 W)] 

£=1 X 
71 

YY [l(V{L^(b)},x)(h t - ht-i)«i(x)] 

t=l X 

n 

<YY R \ h ^ h ^\ 



(18) 



t = l X 



(a) 



X\Rj2(ht - h t -i) 



(b) 



R\X\hn, 



where we used (a) the assumption that the sequence h t is non 
decreasing, and (b) the definition ho = 0. Combining ( fT8l and 
(fTTb into (O yields: 

n 

Y l (B{L { f\b)},x t )<L* n +2R\X\h n (19) 

t=i 

The next step is to bound the performance difference 
between the clairvoyant predictor B{L\ ci) (b)} and the FPF 
forecaster &< FPF) = B{i^ x (6)}. The key is that the new 
element added to L ( t cl) (b) is l(b,xt), and the dither element 
u(xt) (i.e. belonging to the state that actually happened at time 
t) contributes an offset in the same direction, which cancels 
this addition or most values of u(xt). For this purpose let us 
write the accumulated losses as: 

L<?l 1 (b)=Y l (b,x)(N t _ 1 (x) + h t u t (x)) 



= L c + h t u(xt)l(b, x-t) 
Lt l \b) = Y l (b,x)(N t (x) + h t u t (x)) 

X 

= Lfl 1 {b) + l{b,x t ) 

= L C + (h f u(x t ) + l)l{b,x t ) 



(20) 



where we defined 

L c = J2 l (b,x)Nt-i(x)+ Y l{b,x)h t u t {x). (21) 

Noticing that the common part L c is independent of u(xt), 
we compute the conditional expectation given L c for each of 
the predictors: 

E \l(B{L n + htu(xt)l(b,xt)\,xt) L c 



L, 



1{B{L C + h t u(x t )l(b, x t )}, x t ) 
l(B{L c + h t l(b, x t )v}, x t )dv 

g(v)dv, 



(22) 



v=0 
1 



v=0 



where we defined for brevity g(v) 
h t l(b,x t )v},xt), and 



E 



l(B{Lf\b)},x t ) 



L, 



l(B{L c + (h t u(x t ) + l)l(b, x t )}, x t ) 
1{B{L C + (h t v + l)l(b, x t )}, x t )dv 



l(B{L c 



L, 



i 

v=0 



f liBiLc + htiv + h; 1 )^ 

Jv=0 



x t )},x t )dv 



(23) 



l+h; 1 



v=h t 

l+h: 



v=h t 



l(B{L c + h t l{b, xt)v}, x t )dv 



g(v)dv 



The integrands in (l22l,(|23l are equal. Let us temporarily 
assume that for t > 1, ht > 1, so that the integration regions 
partially overlap. For most of the integration region, because 
the integrands are the same (no matter what l(-,Xt) evaluates 
to), and the integration regions overlap, they cancel out, and 
we remain with the contribution of the edges where there is 
no overlap: 



J(B{L t W 1 (6)}, a;t )-J(B{Lr ) (6)},^) 



L, 



(22].{23] 



g(v)dv 



v=0 



i+h; 



g(v)dv 



v—h. 



K 1 



g{v)dv 



l + h- 



g(v)dv 



(24) 



< / \9{v)\dv 
'[o.fc^uli.i+fc;- 1 ] 

< 2-R-ht 1 . 

Recall that we assumed h t > 1. For h t < 1 the bound d24l i is 
trivially true (because the RHS is at least 2R), and therefore 
it holds for all h t . 

Applying the iterated expectations law and accumulating 
(1241 yields: 



E 



Y imL^ib)}, x t ) y l (B{Lr\b)}, Xt ) 



,t=i 



i=i 



<2RYh^ 



t=i 
(25) 
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which, together with (fT9] l yields: 



B. Proof of Theorem \3\ ( Log loss ) 



E 



E 



J2 KB{L?\(b)}, x t ) - £ 'WW}. **) 

fc: 

n 

£i(B{L<->(&)}>*t) 



E 



< 2R{J2h- 1 + \X\h n \^A, 



(26) 



which proves the first claim of the theorem. 

3) Choices of the dither amplitude sequence: There re- 
mains the question of selecting the sequence of dither ampli- 
tudes h t . For a given horizon n, a simple calculation shows, 
that the best choic e in terms of minimizing A is a constant 
ht, which equals \J As will be seen below, a good choice 
of a varying h t that yields an infinite horizon solution (i.e. in 
which t he s etting of h t does not depend on the horizon n) is 
ht = J pry. However, because in real life we do not choose 
an "optimal" ht, it is first desired to show that for a wide range 
of choices, the resulting predictor's expected regret tends to 
zero. This analysis is rather straightforward and reoccurs in 
many developments of this kind HI Ex 4.71 lfT0l . So, let us 
choose h t = hi ■ t a , with a £ (0, 1). Then, 

n n , , n s 

t=i t=i ^ ^ x=1 ' 

= K 1 (l + -^—(n 1 -* 1)) < h-'-^—n 1 -* 
\ 1 — a ) 1 — a 



(27) 



Substituting in ( 1261 1 yields 



A 

n 




<2R[hl 



X\ ■ h x n a - 



(28) 



For any a £ (0, 1) and any hi, this yields — — > 0, i.e. 
Hannan's consistency. It is straightforwar d to see that the best 
choice is obtained by a = | and hi = J t^t, which yields: 



A 



-=™(2hS + \X\.hi) 



AR 



2\X\ 



(29) 



4) Proof of Corollary \2.1\ To prove the corollary is it 
sufficient to notice that all strategies for which the loss is 
computed in the proof of Theorem|2] are in the aforementioned 
set of optimizing strategies. For the L 2 loss it is easy to see 
that the set of optimizing strategies is the convex hull of X 
(the strategy for given A(x) can be interpreted as a center of 
mass of X with varying weights to the different points). 



The sequence is Xt, t = 1, . . . , n. The accumulated loss for 
probability q t is Y^t=i 1°§ qt (x t ) • ^ e ^ est r ' xec ' It in hindsight 
is q t (x) = P x (x) and yields L* n = nJ2 x P x (x) log -^y. For 

the universal estimator proposed: P^(x) = c^ 1 ■ (Nt-i(x) + 

htUt(x)), and it is easy to see that given P^ u \x), the choice 

of qt(-), the probability distribution for the next state is just 

q t (-) = argmin E log -Xr = P t (u) (x) 
q p t (u) (x) HK ' 

n 

E[L„]=E^log- R — 

t=X r t [Xt) 
n 

= sJ2 [Elog(ct) - Elog(JV t _i(a;t) + h t u t (x t ))} 
t=i 

(30) 

In general, in order to achieve a small regret for the log 
loss, it is required that the overall contribution of (xt) for 
all occurrences of a certain state x t = x, would approximate 
P x (x). However the most important property, which is not 
satisfied by FL, is not to give a probability too close to 
for a certain state x on its first appearance in the sequence. 
I.e. if N t -i(xt) = 0, it is required that Elog(JVt_i(x t ) + 
htUt(xt)) = Elog(h t u t (xt)) is finite. Indeed it is easy to 
verify that this holds. 

Following is the detailed calculation and bounding for 
E[L„]. The normalized c f is bounded as: 



c t = J2( N t-i(xt) + htut(x)) < t - 1 + \X\ht 

X 

In the below, denote for conciseness N t -i(xt) = v 

Elog(N t -i(x t ) + h t u t (x t )) 

= Elog(« + h t u t {xt)) = I \og{v + h t y)dy 
Jq 

•v+ht 



(31) 



= K l [y\ogy-y]l +ht 

= hf 1 [(v + h t ) \og{v + h t ) -vlogv- h t ] 

( " } h- t x [h t \og{v + h t ) + v log (1 + ^) - ht] 



(32) 



> 



hV 1 



ht 

h t \og(v + h t ) + v v 

1 + — 



log(w + h t )- 



v + ht 
log(N t -i(x t ) + h t ) - 



Nt-i{x t ) + h t 



In (a) notice that v = is a special case, using OlogO = 0, 
it is easy to verify that in this case the the expression before 
(a) for v = equals log(h t ) — 1, and therefore the inequality 
holds. 

Returning to (1301 : 
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with (l36l l yields: 



E[L n ] = J2 pEIog(ct) - Elog(JV t _i(a; t ) + h t u t (x))] 
t=i 

<53M*-! + l*lk) 



e[l„] - l; 

< £ (log (t - 1 + | AT|/h) - log(t)) 



E 

t=l 



■]og(N t -i(xt) + ht) + 



ht 



N t -x{x t ) + ht 



E 

t:x t =x 



53 [log(iY t _ 1 (a:) 



^3 log (< - 1 + i^fifct) 



t=l 



E E 

xGX t:x t =x 



log(JV t _i(a) + /it) + 



(33) 



E 

t=i 



log 1 



log(JV t _i(a:) + /i t ) 

*|/tt - 1 
t 



For the second sum, which was broken into the subsequenes 
in which a specific state x appears, notice that JV t _ 1 (x) 
increases by 1 between consecutive elements of the internal 
sum. The final value of N t ^i(x) in the last element equals 
the total number of apperances of x, N n (x). At this point, it 
is beneficial to write L* in a similar form: 



p Ax) 

= 537v„(x)io g - 



N n (x) 

= n log n - 53 N n (x) log N n (x) 



(34) 



A consequence of the bound on the size of a type class 
\Tp\ < exp(nH(P)) lfl4l Lemma II. 2] is (considering the 
type defined by the sequence x, (. . . , ; . . .J and taking 
the log of both sides): 



log 



Uxex^nix) 



< n 



E 



N n (x) n 
log 



N n (x) 



, log n - 53 N n (x) log N n (a 



x£X 



(35) 



Plugging into d34l i yields 

K > log 



ru*#»(*)i, 

n JV n (x) 

= E log W ~ E E lo s( m ) 



(36) 



t=i 



^GA* m— 1 



Notice that this way of bounding L* is slightly non stan- 
dard: rather than writing L* in a similar form to Ljj, it 
would generally be simpler to write the bound on E[L n ] 
using factorials, and simplify it using Stirling's approximation, 
obtaining a form similar to (l34l . however this approach does 
not hold for varying h t . 

Let us now assume h t is non-decreasing. Combining (l33l 



E E 

xGX t:x t =x 

h t 



log 1 



1 



N t -i(x) + h t 



(37) 



< 



E 



E 

E 

t=i 



iV*-i(x) 4 

E E 

ifA 1 t:x t =x 

\X\ht - 1 
|#|/i t -l 



E E 

t:x t =x 



1 



E 



1 



i(x) + h t 



- ^t_i(x t ) + ^ 



Let us consider the sequence x that maximizes the second 
sum. As clear intuitively, and will be proven below, this 
sequence selects all states of a; in a round-robin fashion, 
which minimizes the growth rate of N t -i(x t ) and for which 

JVt-i(s t ) = LpTJ- 

First, for a given type (i.e. for given {N n (x)} x ^x), consider 
the order that would yield the maximum. It is clear that the ra- 
th occurrences of different states (assuming these states indeed 
occur at least m times) should occur at consecutive t-s. In other 
words, the sequence N t -i(xt) is non decreasing. Suppose 
that the opposite occurs, i.e. N t -2(xt-i) > Nt-i(xt), then 
obviously x t ^ Xt-i- Let us flip the order of these states, i.e. 
let x' t = Xt-i and x' t _ 1 = Xt, then as a result the counts 
will also flip, 7V t '_ 2 (xt_ 1 ) = N t -i(x t ) and A/^_ 1 (xJ) = 
N t -2{xt~i)- This is easiest to see via an example: suppose the 
sequence is x = (c, a, a, b, c, c), then the counts N t -i(xt) are 
0,0,1,0,1,2. After flipping the states t = 3,4 the sequence 
is x' = (c, a, b, a, c, c) and the counts are 0, 0, 0, 1, 1, 2. The 
elements pertaining to other times are not affected by this flip, 
while the sum of the two elements is now: 



1 



1 



N' t _ 2 {x' t _ x ) + ht-i 
1 

~ N t -i(x t ) + h t -i 
1 

> 



N' t _M) 



1 



N t - 2 {xt-i) 
1 



(38) 



Nt- 2 (x t -i) + h t -i Nt-i(xt) + h t 



where the inequality holds because h t > ht-\ and 
Nt-2(xt-i) > Nt-i(xt). This can be seen by direct algebraic 
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manipulation, or by using Lemma [TJ with the convex function as well. A more detailed evaluation yields: 

m = t- 



Lemma 1. Let f be a convex-U function and ao,ai,bo,bi 
satisfy a\ > ciq and b\ > 60, then 

f(a + b Q ) + /(01 + 61) > f(a + h) + /0i + b ) (39) 

I.e. the maximum sum is obtained by joining the smaller and 
bigger elements together. 



A = (|AU-l)Vi+V— n 

* 1 \- t W\\ + h 



i* 

< (\X\h-l) (l + 



= (|#|fc-l)(l + log(n)) + |*|lG 

M = l_ V|log(n-|Af| + l) 
< \X\\og(n). 



Proof: Let us assume there is strict inequality at least in 
one of the pairs, otherwise the result holds trivially. Write the 

hybrid sums as convex combinations of the homogenous sums: This ends the proof of Theorem [3] 



1 



* - 1 + h 



t=o Jx 

n-\X 



\X\h 



dt 



(44) 



a + h = A(a + b ) + (1 - A)(ai + 61) 
a\ + b = (1 - A)(a + 60) + H a i + hi), 



with 



A = 



ai - a 



ai - a + h - bo 
From d40t and the convexity of /: 



e [0,1]. 



(40) 



(41) 



f(a + h) < Xf(a + b ) + (1 - A)/(oi + 61) 
/(ai + 6 ) < (1 - A)/(a + 6 ) + A/( ai + M- 



(42) 



Summing the two equations in d42l) yields the desired result 



The conclusion is that the sequence N t -i(xt) is non de- 
creasing. Next, is it obvious that to increase the sum, all states 
in x should be chosen approximately the same number of 
times. If at some point, a state x has been chosen for the 
m-th time at time t, while another state x' did not appear 
m — 2 times at this point, then because of the monotonicity of 
Nt-i(xt), x' can never appear again in an optimal sequence. 
Clearly, choosing x' instead of x at time t would decrease 
N t ^ 1 (x t ) for time t as well as for all future occurrences of 
the state x, and therefore will increase the sum. 

This concludes the proof that the last sum in ( f37T > is 
maximized by N t -i(xt) = LwM ■ Now, (|37l i may be rewritten 



ML, 



LI < 



\x\ht - 1 



t=l 



*=i 



A. (43) 



The last bound is only a function of {h t } and not of the 
sequence x. If h t grows sublinearly, then the dominant factor 
in the second sum will be |_pp]"J an( l tne sum would grow 
like 0(log(n)). On the other hand, any growth rate of h t that 
satisfies — y)? =1 A 1 — > would yield normalized expected 

regret tending to 0, and a sufficient condition is ^ — > 0, and 
particularly this holds for h t = 0(t a ),a € [0, 1). 

If h t is constant, then the first sum grows like 0(log(n)) 



C. Proof of Theorem Q] 

Both Theorem [2] and Theorem [3] show the normalized 
expected regret tends to with n, and it remains to change 
from claims on expected regret to claims on the almost-sure 
regret. To prove that the regret tends to almost surely, or 
more precisely, limsup„_ j . 00 (i rl — L*) < 0, using the already 
established fact that limsup JWOO (E[L„] — L*) < 0, it remains 
to show that i(L„ — E[L„]) — > almost surely. 



-(L r 



1 " 

N) = - Yl (Kk x t ) - E[l{b t , xt)]) (45) 



t=i 



To show that the mean above converges to almost surely, we 
use Kolmogorov's criterion for the applicability of the Strong 
Law of Large numbers 03] ■ The elements of the sequence 
7 t = l(b t ,Xt) — E[l(bt,x t )] have zero mean, and are inde- 
pendent, because each 6 t depends only on the deterministic 
history of the sequence, and on u t (x), which are assumed 
independent. Notice that -f t are not identically distributed. 



00 Var(7t ) 



< CXD. 



Kolmogorov's criterion requires that 
This holds trivially for bounded loss functions, for which the 
boundness of |7 t | yields a constant bound on its variance. 
Proving that this condition holds for the case of the log loss 
function is a rather technical calculation which is deferred to 
Appendix- |A] □ 



Appendix 

A. Completion of the proof of Theorem Ulfar the log loss 

This appendix completes the proof of Theorem [TJ from 
Section IV-CI by showing that for the case of the log loss, 
the Kolmogorov criterion holds and therefore the normalized 
regret converges almost surely to the normalized expected 
regret. Our purpose is to upper bound the following variance: 



Var( 7t ) = Vav(l(b t ,x t )) = Var 



loi 



(if^t)) 



(46) 

and show that Kolmogorov's criterion ^ZSi W ^ 00 holds. 
P^ u \xt) is defined in ^} as 



N t _x(x) + h t ■ u t (x) 
t-l + ht -^x'exMx')' 



(47) 
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We have: 

log (p/ u) (x t )) = log(iVt_x(xt) + h t ■ ut{x t )) 

-\og(t-i+h t ■ u ^ x ')) 



where we used again the fact that log 2 (it) is increasing for 
u > 1. Combining (08]) with d49b and the bounds above yields: 



x'ex 



'N t -x{xt) . . 

lo s ( — i \-M x t) 



(48) 



= Var 

< 4 + 

< 8 + 



\og(P t (u \x t ) 



log 



A 

' t- 1 



2 log 2 
4 log 2 



t - 1 

t - 1 
~hV 



1+4 + 2 log 2 



t - 1 



I A' I 



I A" I 



x'ex 



Using 

Var(A - B) < E [(A - B) 2 } < 2E [A 2 ] + 2E [B 2 ] , (49) 

where the second inequality stems from (a — b) 2 = 2a 2 + 
2b 2 — (a + b) 2 < 2a 2 + 2b 2 , it is enough to bound the expected 
squared value of each log in d48l separately. For a uniform r.v. 
U ~ U[0, 1] and a constant a > 0, the following bound holds: 

»i 

log 2 (a + u)du 



E[log 2 (a + J7)] 



< 



< 



log 2 (u)d-u 

i r a + l 
log 2 {u)du + 



log (u)du 

max(a, 1) 
1 ra+l 

log 2 (u)du+ / log 2 (u)du 

Jmax(a,l) 



< U 



log 2 (w) - 2ulog(u) +u\l + log 2 (a + 1) 



= 2 + log 2 (a + l). 

(50) 

We used the fact that log 2 (w) is increasing for u > 1. Notice 
that the bound is trivial for a > 1 because U < 1. Applying 
the bound to the squared elements in (1481 : 



E 



log2 , N t ^( Xt ) 



+ wt(a;t) 



< 2 + log 2 



t - 1 



(51) 

For the second term, the expectation is first applied only to 
one arbitrary element u t (x), while conditioning on the other 
elements: 



(53) 

Under the assumptions of Theorem [T] ht = hi ■ t a with 
a e (0, 1) and so a 2 = 0(log 2 (t)) and clearly Kolmogorov's 
criterion holds. □ 
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E 



log 2 



E <^ E 



m 

< E { 2 + log 



< 2 + lo. 




E ut{x') \u t (x'),x' ^ x 
-Jex ) 



(52) 



